Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

Identifieur interne : 000172 ( Main/Exploration ); précédent : 000171; suivant : 000173

Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

Auteurs : Cyril Grouin [France] ; Pierre Zweigenbaum

Source :

RBID : pubmed:23920600

Descripteurs français

English descriptors

Abstract

In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.

PubMed: 23920600


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author>
<name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1">
<nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23920600</idno>
<idno type="pmid">23920600</idno>
<idno type="wicri:Area/PubMed/Corpus">000021</idno>
<idno type="wicri:Area/PubMed/Curation">000021</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000021</idno>
<idno type="wicri:Area/Ncbi/Merge">000169</idno>
<idno type="wicri:Area/Ncbi/Curation">000169</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000169</idno>
<idno type="wicri:doubleKey">0926-9630:2013:Grouin C:automatic:de:identification</idno>
<idno type="wicri:Area/Main/Merge">000175</idno>
<idno type="wicri:Area/Main/Curation">000172</idno>
<idno type="wicri:Area/Main/Exploration">000172</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author>
<name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1">
<nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName>
<region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</analytic>
<series>
<title level="j">Studies in health technology and informatics</title>
<idno type="ISSN">0926-9630</idno>
<imprint>
<date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Data Mining (methods)</term>
<term>Electronic Health Records</term>
<term>France</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="MESH" type="geographic" xml:lang="en">
<term>France</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Data Mining</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Electronic Health Records</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="Wicri" type="geographic" xml:lang="fr">
<term>France</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Île-de-France</li>
</region>
<settlement>
<li>Orsay</li>
</settlement>
</list>
<tree>
<noCountry>
<name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</noCountry>
<country name="France">
<region name="Île-de-France">
<name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000172 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000172 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     pubmed:23920600
   |texte=   Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:23920600" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024